{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By deploying or using this software you agree to comply with the [AI Hub Terms of Service]( https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/GoogleCloudPlatform/ai-platform-samples/blob/master/notebooks/samples/aihub/xgboost_classification/xgboost_classification.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "tvgnzT1CKxrO"
   },
   "source": [
    "# Overview\n",
    "\n",
    "This notebook provides an example workflow of using the [Distributed XGBoost ML container](https://aihub.cloud.google.com/u/0/p/products%2F0ccd8a63-71a7-4e48-a68b-685692a62e92) for training a classification ML model.\n",
    "\n",
    "### Dataset\n",
    "\n",
    "The notebook uses the [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set). It consists of 3 different types of irises (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x5 table.\n",
    "\n",
    "### Objective\n",
    "\n",
    "The goal of this notebook is to go through a common training workflow:\n",
    "- Create a dataset\n",
    "- Train an ML model using the [AI Platform Training](https://cloud.google.com/ai-platform/training/docs) service\n",
    "- Monitor the training job with [TensorBoard](https://www.tensorflow.org/tensorboard)\n",
    "- Identify if the model was trained successfully by looking at the generated \"Run Report\"\n",
    "- Deploy the model for serving using the [AI Platform Prediction](https://cloud.google.com/ai-platform/prediction/docs/overview) service\n",
    "- Use the endpoint for online predictions\n",
    "- Interactively inspect the deployed ML model with the [What-If Tool](https://pair-code.github.io/what-if-tool/) \n",
    "\n",
    "### Costs \n",
    "\n",
    "This tutorial uses billable components of Google Cloud Platform (GCP):\n",
    "\n",
    "* Cloud AI Platform\n",
    "* Cloud Storage\n",
    "\n",
    "Learn about [Cloud AI Platform\n",
    "pricing](https://cloud.google.com/ml-engine/docs/pricing) and [Cloud Storage\n",
    "pricing](https://cloud.google.com/storage/pricing), and use the [Pricing\n",
    "Calculator](https://cloud.google.com/products/calculator/)\n",
    "to generate a cost estimate based on your projected usage."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ze4-nDLfK4pw"
   },
   "source": [
    "### Set up your local development environment\n",
    "\n",
    "**If you are using Colab or AI Platform Notebooks**, your environment already meets\n",
    "all the requirements to run this notebook. You can skip this step."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "gCuSR8GkAgzl"
   },
   "source": [
    "**Otherwise**, make sure your environment meets this notebook's requirements.\n",
    "You need the following:\n",
    "\n",
    "* The Google Cloud SDK\n",
    "* Git\n",
    "* Python 3\n",
    "* virtualenv\n",
    "* Jupyter notebook running in a virtual environment with Python 3\n",
    "\n",
    "The Google Cloud guide to [Setting up a Python development\n",
    "environment](https://cloud.google.com/python/setup) and the [Jupyter\n",
    "installation guide](https://jupyter.org/install) provide detailed instructions\n",
    "for meeting these requirements. The following steps provide a condensed set of\n",
    "instructions:\n",
    "\n",
    "1. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)\n",
    "\n",
    "2. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)\n",
    "\n",
    "3. [Install\n",
    "   virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)\n",
    "   and create a virtual environment that uses Python 3.\n",
    "\n",
    "4. Activate that environment and run `pip install jupyter` in a shell to install\n",
    "   Jupyter.\n",
    "\n",
    "5. Run `jupyter notebook` in a shell to launch Jupyter.\n",
    "\n",
    "6. Open this notebook in the Jupyter Notebook Dashboard."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "BF1j6f9HApxa"
   },
   "source": [
    "### Set up your GCP project\n",
    "\n",
    "**The following steps are required, regardless of your notebook environment.**\n",
    "\n",
    "1. [Select or create a GCP project.](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n",
    "\n",
    "2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)\n",
    "\n",
    "3. [Enable the AI Platform APIs and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component)\n",
    "\n",
    "4. Enter your project ID in the cell below. Then run the  cell to make sure the\n",
    "Cloud SDK uses the right project for all the commands in this notebook.\n",
    "\n",
    "**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "oM1iC_MfAts1"
   },
   "outputs": [],
   "source": [
    "PROJECT_ID = \"[your-project-id]\" #@param {type:\"string\"}\n",
    "! gcloud config set project $PROJECT_ID"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "dr--iN2kAylZ"
   },
   "source": [
    "### Authenticate your GCP account\n",
    "\n",
    "**If you are using AI Platform Notebooks**, your environment is already\n",
    "authenticated. Skip this step."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "sBCra4QMA2wR"
   },
   "source": [
    "**If you are using Colab**, run the cell below and follow the instructions\n",
    "when prompted to authenticate your account via oAuth.\n",
    "\n",
    "**Otherwise**, follow these steps:\n",
    "\n",
    "1. In the GCP Console, go to the [**Create service account key**\n",
    "   page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).\n",
    "\n",
    "2. From the **Service account** drop-down list, select **New service account**.\n",
    "\n",
    "3. In the **Service account name** field, enter a name.\n",
    "\n",
    "4. From the **Role** drop-down list, select\n",
    "   **Machine Learning Engine > AI Platform Admin** and\n",
    "   **Storage > Storage Object Admin**.\n",
    "\n",
    "5. Click *Create*. A JSON file that contains your key downloads to your\n",
    "local environment.\n",
    "\n",
    "6. Enter the path to your service account key as the\n",
    "`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "PyQmSRbKA8r-"
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "\n",
    "# If you are running this notebook in Colab, run this cell and follow the\n",
    "# instructions to authenticate your GCP account. This provides access to your\n",
    "# Cloud Storage bucket and lets you submit training jobs and prediction\n",
    "# requests.\n",
    "\n",
    "if 'google.colab' in sys.modules:\n",
    "  from google.colab import auth as google_auth\n",
    "  google_auth.authenticate_user()\n",
    "\n",
    "# If you are running this notebook locally, replace the string below with the\n",
    "# path to your service account key and run this cell to authenticate your GCP\n",
    "# account.\n",
    "else:\n",
    "  %env GOOGLE_APPLICATION_CREDENTIALS ''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "zgPO1eR3CYjk"
   },
   "source": [
    "### Create a Cloud Storage bucket\n",
    "\n",
    "**The following steps are required, regardless of your notebook environment.**\n",
    "\n",
    "You need to have a \"workspace\" bucket that will hold the dataset and the output from the ML Container. Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets. \n",
    "\n",
    "You may also change the `REGION` variable, which is used for operations\n",
    "throughout the rest of this notebook. Make sure to [choose a region where Cloud AI Platform services are available](https://cloud.google.com/ml-engine/docs/tensorflow/regions). You may not use a Multi-Regional Storage bucket for training with AI Platform."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "both",
    "colab": {},
    "colab_type": "code",
    "id": "MzGDU7TWdts_"
   },
   "outputs": [],
   "source": [
    "BUCKET_NAME = \"[your-bucket-name]\" #@param {type:\"string\"}\n",
    "REGION = 'us-central1' #@param {type:\"string\"}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "-EcIXiGsCePi"
   },
   "source": [
    "**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "NIq7R4HZCfIc"
   },
   "outputs": [],
   "source": [
    "! gsutil mb -l $REGION gs://$BUCKET_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ucvCsknMCims"
   },
   "source": [
    "Finally, validate access to your Cloud Storage bucket by examining its contents:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "vhOb7YnwClBb"
   },
   "outputs": [],
   "source": [
    "! gsutil ls -al gs://$BUCKET_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "i7EUnXsZhAGF"
   },
   "source": [
    "## PIP Install Packages and dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "wyy5Lbnzg5fi"
   },
   "outputs": [],
   "source": [
    "! pip install witwidget"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "XoEqT2Y4DJmf"
   },
   "source": [
    "### Import libraries and define constants"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "pRUOFELefqf1"
   },
   "outputs": [],
   "source": [
    "from __future__ import absolute_import\n",
    "from __future__ import division\n",
    "from __future__ import print_function\n",
    "\n",
    "import os\n",
    "import time\n",
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn import datasets\n",
    "from sklearn.model_selection import train_test_split\n",
    "from IPython.core.display import HTML\n",
    "from googleapiclient import discovery"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load Iris dataset\n",
    "iris = datasets.load_iris()\n",
    "names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']\n",
    "data = pd.DataFrame(iris.data, columns=names)\n",
    "\n",
    "# add target\n",
    "data['target'] = iris.target\n",
    "\n",
    "# split\n",
    "training, validation = train_test_split(data, test_size=50, stratify=data['target'])\n",
    "\n",
    "# standardization\n",
    "training_targets = training.pop('target')\n",
    "validation_targets = validation.pop('target')\n",
    "\n",
    "data_mean = training.mean(axis=0)\n",
    "data_std = training.std(axis=0)\n",
    "training = (training - data_mean) / data_std\n",
    "training['target'] = training_targets\n",
    "\n",
    "validation = (validation - data_mean) / data_std\n",
    "validation['target'] = validation_targets\n",
    "\n",
    "print('Training data head')\n",
    "display(training.head())\n",
    "\n",
    "training_data = os.path.join('gs://', BUCKET_NAME, 'data/train.csv')\n",
    "validation_data = os.path.join('gs://', BUCKET_NAME, 'data/valid.csv')\n",
    "\n",
    "print('Copy the data in bucket ...')\n",
    "with tf.io.gfile.GFile(training_data, 'w') as f:\n",
    "  training.to_csv(f, index=False)\n",
    "with tf.io.gfile.GFile(validation_data, 'w') as f:\n",
    "  validation.to_csv(f, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cloud training\n",
    "\n",
    "### Accelerator and distribution support\n",
    "\n",
    "| GPU | Multi-GPU Node | TPU | Workers | Parameter Server |\n",
    "|---|---|---|---|---|\n",
    "| Yes | No | No | Yes | No |\n",
    "\n",
    "To have distribution and/or accelerators to your AI Platform training call, use parameters similar to the examples as shown below.\n",
    "```bash\n",
    "    --master-machine-type standard_gpu\n",
    "    --worker-machine-type standard_gpu \\\n",
    "    --worker-count 2 \\\n",
    "```\n",
    "### AI Platform training\n",
    "\n",
    "- [Distributed XGBoost ML container documentation](https://aihub.cloud.google.com/u/0/p/products%2F0ccd8a63-71a7-4e48-a68b-685692a62e92).\n",
    "- [AI Platform training documentation](https://cloud.google.com/sdk/gcloud/reference/ai-platform/jobs/submit/training)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "output_location = os.path.join('gs://', BUCKET_NAME, 'output')\n",
    "\n",
    "job_name = \"xgboost_classification_{}\".format(time.strftime(\"%Y%m%d%H%M%S\"))\n",
    "!gcloud ai-platform jobs submit training $job_name \\\n",
    "    --master-image-uri gcr.io/aihub-c2t-containers/kfp-components/trainer/dist_xgboost:latest \\\n",
    "    --region $REGION \\\n",
    "    --scale-tier CUSTOM \\\n",
    "    --master-machine-type standard \\\n",
    "    -- \\\n",
    "    --output-location {output_location} \\\n",
    "    --training-data {training_data} \\\n",
    "    --validation-data {validation_data} \\\n",
    "    --target-column target \\\n",
    "    --data-type csv \\\n",
    "    --number-of-classes 3 \\\n",
    "    --fresh-start True \\\n",
    "    --objective multi:softprob \\"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Local training snippet\n",
    "\n",
    "Note that the training can also be done locally with Docker\n",
    "```bash\n",
    "docker run \\\n",
    "    -v /tmp:/tmp \\\n",
    "    -it gcr.io/aihub-c2t-containers/kfp-components/trainer/dist_xgboost:latest \\\n",
    "    --output-location /tmp/fm_classification \\\n",
    "    --training-data /tmp/iris_train.csv \\\n",
    "    --validation-data /tmp/iris_valid.csv \\\n",
    "    --target-column target \\\n",
    "    --data-type csv \\\n",
    "    --number-of-classes 3 \\\n",
    "    --objective multi:softprob\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inspect the Run Report\n",
    "\n",
    "The \"Run Report\" will help you identify if the model was successfully trained."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not tf.io.gfile.exists(os.path.join(output_location, 'report.html')):\n",
    "  raise RuntimeError('The file report.html was not found. Did the training job finish?')\n",
    "\n",
    "with tf.io.gfile.GFile(os.path.join(output_location, 'report.html')) as f:\n",
    "  display(HTML(f.read()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deployment parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#@markdown ---\n",
    "model = 'xgboost_iris' #@param {type:\"string\"}\n",
    "version = 'v2' #@param {type:\"string\"}\n",
    "#@markdown ---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deploy the model for serving\n",
    "- https://cloud.google.com/ai-platform/prediction/docs/deploying-models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the exact location of the model is in model_uri.txt\n",
    "with tf.io.gfile.GFile(os.path.join(output_location, 'model_uri.txt')) as f:\n",
    "  model_uri = f.read().replace('/model.bst', '')\n",
    "\n",
    "# create a model\n",
    "! gcloud ai-platform models create $model --region $REGION\n",
    "\n",
    "# create a version\n",
    "! gcloud ai-platform versions create $version --region $REGION \\\n",
    "  --model $model \\\n",
    "  --runtime-version 1.15 \\\n",
    "  --origin $model_uri \\\n",
    "  --framework XGBOOST \\\n",
    "  --project $PROJECT_ID"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use the endpoint for online predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# format the data for serving\n",
    "instances = validation.drop(columns='target').values.tolist()\n",
    "validation_targets = validation['target']\n",
    "display(instances[:2])\n",
    "\n",
    "service = discovery.build('ml', 'v1')\n",
    "name = 'projects/{project}/models/{model}/versions/{version}'.format(project=PROJECT_ID,\n",
    "                                                                     model=model,\n",
    "                                                                     version=version)\n",
    "body = {'instances': instances}\n",
    "\n",
    "response = service.projects().predict(name=name, body=body).execute()\n",
    "if 'error' in response:\n",
    "    raise RuntimeError(response['error'])\n",
    "\n",
    "class_probabilties = [row for row in response['predictions']]\n",
    "predicted_classes = np.array(class_probabilties).argmax(axis=1)\n",
    "accuracy = (predicted_classes == validation_targets).mean()\n",
    "print('Accuracy of the predictions: {}'.format(accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inspect the ML model\n",
    "\n",
    "- [What if tool home page](https://pair-code.github.io/what-if-tool/index.html)\n",
    "- [Installation](https://github.com/pair-code/what-if-tool#how-do-i-enable-it-for-use-in-a-jupyter-notebook)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import witwidget\n",
    "from witwidget.notebook.visualization import WitWidget, WitConfigBuilder\n",
    "\n",
    "config_builder = WitConfigBuilder(examples=validation.values.tolist(),\n",
    "                                  feature_names=validation.columns.tolist())\n",
    "config_builder.set_ai_platform_model(project=PROJECT_ID,\n",
    "                                     model=model,\n",
    "                                     version=version)\n",
    "config_builder.set_model_type('classification')\n",
    "config_builder.set_target_feature('target')\n",
    "WitWidget(config_builder)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "TpV-iwP9qw9c"
   },
   "source": [
    "# Cleaning up\n",
    "\n",
    "To clean up all GCP resources used in this project, you can [delete the GCP\n",
    "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "sx_vKniMq9ZX"
   },
   "outputs": [],
   "source": [
    "# Delete model version resource\n",
    "! gcloud ai-platform versions delete $version --quiet --model $model \n",
    "\n",
    "# Delete model resource\n",
    "! gcloud ai-platform models delete $model --quiet\n",
    "\n",
    "# If training job is still running, cancel it\n",
    "! gcloud ai-platform jobs cancel $job_name --quiet\n",
    "\n",
    "# Delete Cloud Storage objects that were created\n",
    "! gsutil -m rm -r $BUCKET_NAME"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
